ONT 16S Sequencing Files
Raw fast5 files from each Oxford Nanopore (ONT) sequencing batch can
be found within the directory named: raw_fast5_files/
A total of 8 sequencing batches were run. Each
batch includes 24 samples from the following treatment and donor
ids:
      1. UT1, UT2, UT3
      2. UT4, UT5, UT6
      3. PLA1, PLA2, PLA3
      4. PLA4, PLA5, PLA6 *NOTE: These samples
had to be re-run so they represent two separate sequencing runs
      5. T1-1, T1-2, T1-3
      6. T1-4, T1-5, T1-6
      7. T2-1, T2-2, T2-3
      8. T2-4, T2-5, T2-6
So for example, the raw fast5 files of samples
within UT1, UT2, UT3 would be in:
raw_fast5_files/UT1_UT2_UT3_fast5s/
ONT 16S Sequence QC Pipeline
After ONT sequencing the following pipeline was
used to demultiplex and quality filter reads:
      Chopper parameters used:
          • Min q score: 10
          • Min length: 1,000 bp
          • Head crop: 50
          • Max length: 1690
4. Multiqc results summary using fastp qc V
0.20.1
   Multiqc results for each sequencing batch can be found in the
directory named: multiqc
Statistical Analysis
Diversity, differential abundance, and relative abundance of taxa
were analyzed via Phyloseq and statistical tests were run between the 4
treatments.
The biom file and metadata table imported into phyloseq can be found
in the directory named: stats_analysis
Results include:
o Alpha diversity metrics - Observed, Shannon,
Simpson (csv tables in stats_analysis directory)
o Alpha diversity statistical tests (results
below)
o Alpha diversity boxplots (tiff images in
stats_analysis/alpha_diversity_plots directory)
      • *NOTE: Observed richness was run on raw
and normalized data. Rarefaction to lowest read number was used as
normalization.
      This was included for comparative purposes. Other normalization
methods can also be run if requested.
o Pairwise differential abundance test via
DeSeq2 (csv table results in stats_analysis/DESEQ2 directory)
      • Volcano plot (tiff image in
stats_analysis/DESEQ2 directory)
o Beta diversity statistical tests -
Bray-Curtis, Euclidean distances
o Beta diversity plots (tiff images in
stats_analysis/beta_diversity_plots directory)
o Relative abundance taxa plots (tiff images can
be found in stats_analysis/taxaplots directory)
#Libraries we will use
library("DESeq2")
library("ggplot2")
library(Biostrings)
library("phyloseq")
library("microbiome")
library("ggthemes")
library(ggnetwork)
library(cowplot)
library("vegan")
library(magrittr)
library(dplyr)
library(EnhancedVolcano)
#Create a few output directories that will hold the output files.
dir.create('./DESEQ2')
dir.create('./ANCOM')
# Load data and add metadata
ps_object <- import_biom("./HOMD_biom_merge.biom")
#import medata and add to phyloseq object
metadata <- read.csv("./metadata.csv", row.names = 1)
metadata = sample_data(data.frame(metadata))
sample_data(ps_object) <- metadata
Statistical Analysis Results
Total number of taxa
The total number of taxonomically identified ASVs across all
samples
#Check sample numbers
ntaxa(ps_object) #Total taxa
[1] 535
Number of taxonomically classified reads per sample
Samples will have varying numbers of taxonomically classified reads
due to different sampling depths during sequencing.
#Check sample numbers
sample_sums(ps_object) #Reads per sample
PLA-6G.trim.fastq_bracken PLA-5F.trim.fastq_bracken PLA-6D.trim.fastq_bracken PLA-6E.trim.fastq_bracken
122253 134537 151338 160823
PLA-4G.trim.fastq_bracken T1-6D.trim.fastq_bracken PLA-5H.trim.fastq_bracken PLA-6F.trim.fastq_bracken
134896 252097 164959 111641
T1-6E.trim.fastq_bracken PLA-4F.trim.fastq_bracken T1-5E.trim.fastq_bracken PLA-4D.trim.fastq_bracken
218460 261085 327649 193079
T1-5F.trim.fastq_bracken T1-5D.trim.fastq_bracken T1-4A.trim.fastq_bracken PLA-5A.trim.fastq_bracken
269287 296705 332687 141418
PLA-5G.trim.fastq_bracken T1-6H.trim.fastq_bracken PLA-5B.trim.fastq_bracken T1-5G.trim.fastq_bracken
244555 219936 271721 281291
PLA-6B.trim.fastq_bracken T1-6G.trim.fastq_bracken PLA-4C.trim.fastq_bracken PLA-4A.trim.fastq_bracken
196991 217395 308241 314998
T1-4H.trim.fastq_bracken T1-6F.trim.fastq_bracken T1-6A.trim.fastq_bracken T1-4B.trim.fastq_bracken
276626 205810 229874 340421
T1-5H.trim.fastq_bracken PLA-4E.trim.fastq_bracken T1-6C.trim.fastq_bracken T1-5C.trim.fastq_bracken
307412 185855 202763 291410
PLA-6H.trim.fastq_bracken T2-4A.trim.fastq_bracken T2-6F.trim.fastq_bracken T2-5D.trim.fastq_bracken
124695 177019 121162 149516
PLA-5C.trim.fastq_bracken T1-4G.trim.fastq_bracken T2-4C.trim.fastq_bracken T2-4B.trim.fastq_bracken
115282 229188 175811 166040
T2-5A.trim.fastq_bracken PLA-6C.trim.fastq_bracken PLA-5D.trim.fastq_bracken PLA-5E.trim.fastq_bracken
143238 149583 148025 147511
T2-4G.trim.fastq_bracken PLA-4B.trim.fastq_bracken T1-6B.trim.fastq_bracken T2-5E.trim.fastq_bracken
122305 307852 233657 230137
T2-5C.trim.fastq_bracken T2-4F.trim.fastq_bracken T2-6H.trim.fastq_bracken T2-6E.trim.fastq_bracken
135595 136812 122927 114585
T2-6G.trim.fastq_bracken PLA-4H.trim.fastq_bracken T2-5F.trim.fastq_bracken T2-6C.trim.fastq_bracken
111199 232902 117429 99830
T2-6A.trim.fastq_bracken T2-4H.trim.fastq_bracken T2-5G.trim.fastq_bracken T1-1E.trim.fastq_bracken
134098 139078 141449 129593
T1-4F.trim.fastq_bracken T2-4E.trim.fastq_bracken T2-6D.trim.fastq_bracken T2-5B.trim.fastq_bracken
291101 158881 111618 156670
T2-5H.trim.fastq_bracken T1-4E.trim.fastq_bracken T1-5A.trim.fastq_bracken T1-5B.trim.fastq_bracken
146811 320326 295293 335887
T1-2C.trim.fastq_bracken PLA-2A.trim.fastq_bracken T1-1H.trim.fastq_bracken T1-2F.trim.fastq_bracken
112517 129688 112481 111566
T1-3D.trim.fastq_bracken T1-1A.trim.fastq_bracken T1-1D.trim.fastq_bracken T1-1B.trim.fastq_bracken
137279 156574 156381 142894
T1-3B.trim.fastq_bracken T1-3E.trim.fastq_bracken T1-3F.trim.fastq_bracken T1-2E.trim.fastq_bracken
143554 136304 126753 137682
T1-2G.trim.fastq_bracken T1-2D.trim.fastq_bracken T2-6B.trim.fastq_bracken PLA-6A.trim.fastq_bracken
138687 135530 107692 221393
T2-4D.trim.fastq_bracken T1-4D.trim.fastq_bracken T1-4C.trim.fastq_bracken T1-1G.trim.fastq_bracken
186787 516674 347729 97588
T1-2H.trim.fastq_bracken T1-1F.trim.fastq_bracken T1-2B.trim.fastq_bracken T1-1C.trim.fastq_bracken
132909 100480 169725 150087
T1-3G.trim.fastq_bracken T1-3H.trim.fastq_bracken T1-2A.trim.fastq_bracken T1-3C.trim.fastq_bracken
132758 137643 121393 134268
PLA-1H.trim.fastq_bracken T1-3A.trim.fastq_bracken PLA-1A.trim.fastq_bracken PLA-1C.trim.fastq_bracken
113231 164631 150169 136540
PLA-3D.trim.fastq_bracken PLA-3C.trim.fastq_bracken PLA-3A.trim.fastq_bracken PLA-3E.trim.fastq_bracken
128055 127764 152664 137933
PLA-3H.trim.fastq_bracken PLA-1G.trim.fastq_bracken UT-5C.trim.fastq_bracken PLA-2F.trim.fastq_bracken
139049 113533 142315 116570
PLA-2D.trim.fastq_bracken PLA-2C.trim.fastq_bracken PLA-3B.trim.fastq_bracken PLA-2E.trim.fastq_bracken
119736 124253 121101 141694
PLA-1E.trim.fastq_bracken PLA-2H.trim.fastq_bracken PLA-3G.trim.fastq_bracken PLA-1B.trim.fastq_bracken
139254 124653 134659 139846
PLA-2G.trim.fastq_bracken PLA-3F.trim.fastq_bracken UT-6B.trim.fastq_bracken UT-6G.trim.fastq_bracken
131090 124131 149386 157610
UT-4B.trim.fastq_bracken UT-5E.trim.fastq_bracken PLA-2B.trim.fastq_bracken UT-4A.trim.fastq_bracken
188401 204864 147226 232458
UT-4D.trim.fastq_bracken PLA-1D.trim.fastq_bracken UT-4G.trim.fastq_bracken PLA-1F.trim.fastq_bracken
230691 170659 146773 114481
UT-4E.trim.fastq_bracken UT-6E.trim.fastq_bracken UT-6F.trim.fastq_bracken UT-5D.trim.fastq_bracken
180292 160952 150999 161566
UT-5F.trim.fastq_bracken UT-5B.trim.fastq_bracken UT-6C.trim.fastq_bracken UT-4H.trim.fastq_bracken
162768 194116 137967 148853
UT-5G.trim.fastq_bracken UT-5A.trim.fastq_bracken UT-4C.trim.fastq_bracken UT-4F.trim.fastq_bracken
155111 166610 210096 174867
UT-5H.trim.fastq_bracken UT-6H.trim.fastq_bracken T2-2C.trim.fastq_bracken T2-2H.trim.fastq_bracken
171615 155942 224131 236759
T2-1F.trim.fastq_bracken T2-3D.trim.fastq_bracken UT-6A.trim.fastq_bracken UT-6D.trim.fastq_bracken
220654 226536 150983 154152
T2-2D.trim.fastq_bracken T2-1G.trim.fastq_bracken T2-1D.trim.fastq_bracken T2-3G.trim.fastq_bracken
212920 196370 254486 241081
T2-3C.trim.fastq_bracken T2-1C.trim.fastq_bracken T2-2B.trim.fastq_bracken T2-1B.trim.fastq_bracken
244282 288104 247158 265356
T2-3A.trim.fastq_bracken T2-3F.trim.fastq_bracken T2-2E.trim.fastq_bracken UT3D.trim.fastq_bracken
254775 249430 228892 193130
T2-3B.trim.fastq_bracken T2-2G.trim.fastq_bracken T2-3E.trim.fastq_bracken T2-1A.trim.fastq_bracken
258304 236985 230370 239366
T2-2F.trim.fastq_bracken T2-1E.trim.fastq_bracken T2-1H.trim.fastq_bracken T2-2A.trim.fastq_bracken
197203 251642 212102 219040
UT3B.trim.fastq_bracken UT2C.trim.fastq_bracken UT2B.trim.fastq_bracken UT1E.trim.fastq_bracken
303307 119023 90053 95039
UT1H.trim.fastq_bracken UT2A.trim.fastq_bracken T2-3H.trim.fastq_bracken UT1F.trim.fastq_bracken
85347 94188 246493 103031
UT1G.trim.fastq_bracken UT1B.trim.fastq_bracken UT2H.trim.fastq_bracken UT1A.trim.fastq_bracken
90941 251468 131175 215807
UT2E.trim.fastq_bracken UT3C.trim.fastq_bracken UT2G.trim.fastq_bracken UT3G.trim.fastq_bracken
174982 159762 230766 293498
UT1D.trim.fastq_bracken UT2F.trim.fastq_bracken UT3H.trim.fastq_bracken UT3E.trim.fastq_bracken
146254 257994 263981 195541
UT2D.trim.fastq_bracken UT3F.trim.fastq_bracken UT3A.trim.fastq_bracken UT1C.trim.fastq_bracken
255864 276688 284484 290816
Alpha Diversity
Measuring species diversity within each sample.
#Function to calculate the mean and the standard deviation for each group.
#data : a data frame
#varname : the name of a column containing the variable to be summarized
#groupnames : vector of column names to be used as grouping variables
# Calculate standard error
sderr <- function(x) {sd(x)/sqrt(length(x))}
data_summary <- function(data, varname, groupnames){
require(plyr)
summary_func <- function(x, col){
c(mean = mean(x[[col]], na.rm=TRUE),
sd = sderr(x[[col]]), na.rm=TRUE)
}
data_sum<-ddply(data, groupnames, .fun=summary_func,
varname)
data_sum <- rename(data_sum, c("mean" = varname))
return(data_sum)
}
Data table with alpha diversity metrics: Observed, Shannon,
Simpson
Example of data table structure (full table saved as csv)
Three alpha diversity metrics are calculated for each sample. The
full table (per_sample_ad_dataframe.csv) can be found in the directory
named: stats_analysis
alpha_div <- cbind(estimate_richness(ps_object,
measures = c('Observed', 'Shannon', 'Simpson')),
sample_data(ps_object))
colnames(alpha_div) <- c('Observed', 'Shannon', 'Simpson')
alpha_div$Labels <- rownames(alpha_div)
ad_df <- alpha_div[,c('Observed', 'Shannon', 'Simpson')]
ad_df <- cbind(ad_df,
sample_data(ps_object)[, c('sample', 'group', 'donor', 'treatment')])
colnames(ad_df) <- c('Observed', 'Shannon', 'Simpson', 'sample', 'group', 'donor', 'treatment')
head(ad_df)
write.csv(ad_df, "./per_sample_ad_dataframe.csv")
Data Summary statistics: Observed, Shannon, Simpson
Summary for each treatment
The average of each alpha diversity metric for all samples within a
treatment.
# Observed
data_summary(ad_df, varname = "Observed", groupnames = c("treatment"))
#data_summary(ad_df, varname = "Observed", groupnames = c("sample"))
#data_summary(ad_df, varname = "Observed", groupnames = c("donor"))
# Shannon
data_summary(ad_df, varname = "Shannon", groupnames = c("treatment"))
#data_summary(ad_df, varname = "Shannon", groupnames = c("sample"))
#data_summary(ad_df, varname = "Shannon", groupnames = c("donor"))
# Simpson
data_summary(ad_df, varname = "Simpson", groupnames = c("treatment"))
#data_summary(ad_df, varname = "Simpson", groupnames = c("sample"))
#data_summary(ad_df, varname = "Simpson", groupnames = c("donor"))
Alpha diversity statistical tests for significance
Kruskal Wallis and Pairwise Wilcoxon tests
Kruskal-Wallis Rank Sum Test for a category (treatment) on each Alpha
Diversity metric calculated. This is a non-parametric method which ranks
the data and tests whether samples from each group analyzed come from
the same distribution.
Pairwise Wilcoxon Rank Sum Test between groups within a category
(treatment) on each Alpha Diversity metric calculated and corrected for
multiple testing. This methods compares two independent samples for all
group levels and includes the false discovery rate (FDR) multiple test
correction. The results below come up as a matrix filled with the
p-values for each treatment comparison.
Observed Richness
Testing ASV/species richness between the 4 treatments.
#### Observed
#Kruskal Wallis tests
kruskal.test(Observed ~ treatment, data = ad_df)
Kruskal-Wallis rank sum test
data: Observed by treatment
Kruskal-Wallis chi-squared = 11.944, df = 3, p-value = 0.007577
#kruskal.test(Observed ~ sample, data = ad_df)
#kruskal.test(Observed ~ donor, data = ad_df)
#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df$Observed, ad_df$treatment, p.adjust.method = 'fdr')
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: ad_df$Observed and ad_df$treatment
1.5R 8R Placebo
8R 0.849 - -
Placebo 0.179 0.257 -
Untreated 0.179 0.011 0.011
P value adjustment method: fdr
Shannon Diversity
Testing ASV/species richness between the 4 treatments while also
taking into account relative abundance (evenness).
#### Shannon
#Kruskal Wallis tests
kruskal.test(Shannon ~ treatment, data = ad_df)
Kruskal-Wallis rank sum test
data: Shannon by treatment
Kruskal-Wallis chi-squared = 6.9716, df = 3, p-value = 0.07281
#kruskal.test(Shannon ~ sample, data = ad_df)
#kruskal.test(Shannon ~ donor, data = ad_df)
#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df$Shannon, ad_df$treatment, p.adjust.method = 'fdr')
Pairwise comparisons using Wilcoxon rank sum exact test
data: ad_df$Shannon and ad_df$treatment
1.5R 8R Placebo
8R 0.73 - -
Placebo 0.92 0.64 -
Untreated 0.11 0.17 0.10
P value adjustment method: fdr
Simpson Diversity
Testing ASV/species richness between the 4 treatments while also
taking into account relative abundance (evenness). Simpson diversity
gives more weight to common or dominant ASVs (rare ASVs will not affect
diversity).
#### Simpson
#Kruskal Wallis tests
kruskal.test(Simpson ~ treatment, data = ad_df)
Kruskal-Wallis rank sum test
data: Simpson by treatment
Kruskal-Wallis chi-squared = 3.2214, df = 3, p-value = 0.3587
#kruskal.test(Simpson ~ sample, data = ad_df)
#kruskal.test(Simpson ~ donor, data = ad_df)
#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df$Simpson, ad_df$treatment, p.adjust.method = 'fdr')
Pairwise comparisons using Wilcoxon rank sum exact test
data: ad_df$Simpson and ad_df$treatment
1.5R 8R Placebo
8R 0.60 - -
Placebo 0.80 0.60 -
Untreated 0.60 0.60 0.48
P value adjustment method: fdr
Alpha diversity boxplots
Images of each plot can be found in:
stats_analysis/alpha_diversity_plots
###Group plots
#Observed
obs_ad_plot <- ggplot(ad_df, aes(x=treatment, y=Observed, color = treatment)) + geom_boxplot() + geom_point() + ylab('Observed Richness') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))
obs_ad_plot

ggsave("obs_ad_plot.tiff", dpi = 300, plot = obs_ad_plot, path = "./alpha_diversity_plots/")
#Shannon
shan_ad_plot <- ggplot(ad_df, aes(x=treatment, y=Shannon, color = treatment)) + geom_boxplot() + geom_point() + ylab('Shannon Diversity') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))
shan_ad_plot

ggsave("shan_ad_plot.tiff", dpi = 300, plot = shan_ad_plot, path = "./alpha_diversity_plots/")
#Simpson
simp_ad_plot <- ggplot(ad_df, aes(x=treatment, y=Simpson, color = treatment)) + geom_boxplot() + geom_point() + ylab('Simpson Diversity') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))
simp_ad_plot

ggsave("simp_ad_plot.tiff", dpi = 300, plot = simp_ad_plot, path = "./alpha_diversity_plots/")
Alpha Diversity on Rarefied Data
Normalize Observed Richness metric by rarefying to lowest sample sum
number
Shannon and Simpson diversity indices are already normalized by
relative abundance.
Rarefaction read depth
One method of normalization is to rarefy each sample to the minimum
read depth found across all samples. The minimum read depth for all
samples is:
min(sample_sums(ps_object)) #Minimum reads (used for rarefying)
[1] 85347
Example of datatable structure for rarefied data observed richness
(full table saved as csv)
Observed richness is calculated for each sample now rarefied to the
minimum read depth. The full table
(rarefied_per_sample_ad_dataframe.csv) can be found in the directory
named: stats_analysis
#Rarefy to lowest sample_sums
rare_num <- min(sample_sums(ps_object))
ps_rare <- rarefy_even_depth(ps_object, sample.size = rare_num, rngseed = 999)
#Get datatable with Observed richness for rarefied samples
alpha_div_rare <- cbind(estimate_richness(ps_rare,
measures = c('Observed')),
sample_data(ps_rare))
colnames(alpha_div_rare) <- c('Observed_rare')
alpha_div_rare$Labels <- rownames(alpha_div_rare)
ad_df_rare <- alpha_div_rare[,c('Observed_rare')]
ad_df_rare <- cbind(ad_df_rare,
sample_data(ps_rare)[, c('sample', 'group', 'donor', 'treatment')])
colnames(ad_df_rare) <- c('Observed_rare', 'sample', 'group', 'donor', 'treatment')
head(ad_df_rare)
write.csv(ad_df_rare, "./rarefied_per_sample_ad_dataframe.csv")
Data summary for rarefied data (observed richness)
The average observed richness for all samples within a treatment.
# Observed Rarefied data summary stats
data_summary(ad_df_rare, varname = "Observed_rare", groupnames = c("treatment"))
#data_summary(ad_df_rare, varname = "Observed_rare", groupnames = c("sample"))
#data_summary(ad_df_rare, varname = "Observed_rare", groupnames = c("donor"))
Observed richness statistical tests for rarefied data
Testing ASV/species richness between the 4 treatments.
#### Observed Rarefied
#Kruskal Wallis tests
kruskal.test(Observed_rare ~ treatment, data = ad_df_rare)
Kruskal-Wallis rank sum test
data: Observed_rare by treatment
Kruskal-Wallis chi-squared = 12.925, df = 3, p-value = 0.004801
#kruskal.test(Observed_rare ~ sample, data = ad_df_rare)
#kruskal.test(Observed_rare ~ donor, data = ad_df_rare)
#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df_rare$Observed_rare, ad_df_rare$treatment, p.adjust.method = 'fdr')
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: ad_df_rare$Observed_rare and ad_df_rare$treatment
1.5R 8R Placebo
8R 0.6733 - -
Placebo 0.3884 0.6733 -
Untreated 0.0635 0.0066 0.0066
P value adjustment method: fdr
Rarefied data observed richness boxplot
Images of the plot can be found in:
stats_analysis/alpha_diversity_plots
#Plots: Observed Rarefied
obs_ad_plot_rare <- ggplot(ad_df_rare, aes(x=treatment, y=Observed_rare, color = treatment)) + geom_boxplot() + geom_point() + ylab('Observed Richness Rarefied Data') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))
obs_ad_plot_rare

ggsave("rarefied_obs_ad_plot.tiff", dpi = 300, plot = obs_ad_plot_rare, path = "./alpha_diversity_plots/")
Differential Abundance Analysis
Pairwise differential abundance test between treatments via
DESeq2
The package DESeq2 provides methods to test for differential
abundance by use of negative binomial generalized linear models. DESeq2
uses the Wald test to take into account the dispersion model distance
and negative binomial distribution we see in sequencing data. Overall
this method statistically infers systematic changes between conditions,
as compared to within-condition variability. It estimates dispersion and
logarithmic fold changes to test which sequences/ASVs are significantly
changing within the experimental design.
Additionally, a p-value correction is applied for multiple hypothesis
testing. Traditionally p < 0.05 is considered significant (95% chance
it is not by random chance), but when you are looking at thousands of
ASVs/sequences, you’ll still end up with 5% of them significant off
predicted chance. One of the best corrections to use called
False-Discovery Rate correction (FDR), which is the Benjamini and
Hochberg method.
Here we ran DESeq2 specifying only treatment in the experimental
design. Log fold changes and adjusted p-valus (padj) for all ASVs and
significant ASVs have been saved as csv tables in the
stats_analysis/DESEQ2 directory (named dds_result_table.csv and
sig_result_table.csv respectively).
The below plot shows the fitted curve (red line) used to calculate
the final dispersion estimates (blue dots) of the input data (black
dots).
#Phyloseq to Deseq2
ps_deseq = phyloseq_to_deseq2(ps_object, ~treatment)
#Run Deseq2
dds = DESeq(ps_deseq, test="Wald", fitType="local")
dds_result <- results(dds)
dds_result_table <- cbind(as(dds_result, "data.frame"), as(tax_table(ps_object)[rownames(dds_result), ], "matrix"))
write.csv(dds_result_table, "./DESEQ2/dds_result_table.csv")
#check significant results
dds_sig = dds_result[which(dds_result$padj < 0.05), ]
dds_sig_table = cbind(as(dds_sig, "data.frame"), as(tax_table(ps_object)[rownames(dds_sig), ], "matrix"))
dim(dds_sig_table)
write.csv(dds_sig_table, "./DESEQ2/sig_result_table.csv")
#Perform the variance stabilizing transformation on the data and pull the normalized hit count matrix.
vst <- varianceStabilizingTransformation(dds)
mat <- assay(vst)
#If you have a batch designed in, regress the batch effects.
#mat <- limma::removeBatchEffect(mat, vst$batch)
#assay(vst) <- mat
#Plot the overall dispersion of the experiment to ensure it aligns with expectations.
plotDispEsts(dds)

Volcano plot: DeSeq2 results
Volcano plots are a great way to visualize the pairwise comparisons.
Each shows the log2 fold-change (log2FC) on the x-axis and -log10(padj)
on the y-axis. This should usually look akin to a Plinian eruption
(hence volcano plot).
Image of the volcano plot can be found in the stats_analysis/DESEQ2/
directory
p <- EnhancedVolcano(dds_result_table,
lab=as.character(dds_result_table$Rank9),
title="Pairwise comparisons of taxa (species level) between treatments",
x='log2FoldChange',
y='padj',
legendPosition = 'bottom',
legendLabels=c('Not sig.','Log (base 2) FC','padj-value','padj-value & Log (base 2) FC'),
pCutoff=0.05,
FCcutoff=1
)
print(p)

ggsave("volcano_plot.tiff", dpi = 300, plot = p, path = "./DESEQ2/")
Beta Diversity
Measuring species diversity or the level or similarity/dissimilarity
of species between samples.
#Pairwise Adonis function
pairwise.adonis.dm <- function(x,factors,stratum=NULL,p.adjust.m="fdr",perm=999){
library(vegan)
if(class(x) != "dist") stop("x must be a dissimilarity matrix (dist object)")
co = as.matrix(combn(unique(factors),2))
pairs = c()
F.Model =c()
R2 = c()
p.value = c()
for(elem in 1:ncol(co)){
sub_inds <- factors %in% c(as.character(co[1,elem]),as.character(co[2,elem]))
resp <- as.matrix(x)[sub_inds,sub_inds]
ad = adonis2(as.dist(resp) ~
factors[sub_inds], strata=stratum[sub_inds], permutations=perm);
pairs = c(pairs,paste(co[1,elem],'vs',co[2,elem]));
F.Model =c(F.Model,ad$F[1]);
R2 = c(R2,ad$R2[1]);
p.value = c(p.value,ad$`Pr(>F)`[1])
}
p.adjusted = p.adjust(p.value,method=p.adjust.m)
pairw.res = data.frame(pairs,F.Model,R2,p.value,p.adjusted)
return(pairw.res)
}
# Calculate distance matrices
ps_bc <- phyloseq::distance(ps_object, method = "bray")
ps_euc <- phyloseq::distance(ps_object, method = "euclidean")
PERMANOVAs
PERMANOVA’s with Adonis - Permutational Multivariate Analysis of
Variance. This measures dissimilarity in response to one or more factors
in an analysis of variance design. Dissimilarity statistics are used to
measure distances between data points and test whether they are
significantly different.
We run the Adonis test specifying the treatment groups in the model
formula.
Bray-curtis
Bray-Curtis dissimilarity examines the abundances of ASVs that are
shared between two samples, and the number of ASVs found in each.
Bray-Curtis dissimilarity ranges from 0-1. If 0, the two samples share
all the same ASVs; if 1, they don’t share any ASVs.
Permanova result
The value under the column labeled ‘Pr(>F)’ in the results table
indicates significance (i.e. p-value)
#make a dataframe out of the sample data
sampledf <- data.frame(sample_data(ps_object))
adonis2(ps_bc ~ treatment, data = sampledf)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = ps_bc ~ treatment, data = sampledf)
Df SumOfSqs R2 F Pr(>F)
treatment 3 1.026 0.02703 1.7409 0.062 .
Residual 188 36.936 0.97297
Total 191 37.962 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Euclidean
Measures the distance between two samples in Euclidean space (the
length of the line segment between them).
Permanova result
The value under the column labeled ‘Pr(>F)’ in the results table
indicates significance (i.e. p-value)
adonis2(ps_euc ~ treatment, data = sampledf)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = ps_euc ~ treatment, data = sampledf)
Df SumOfSqs R2 F Pr(>F)
treatment 3 3.0164e+10 0.03385 2.1953 0.008 **
Residual 188 8.6105e+11 0.96615
Total 191 8.9121e+11 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Pairwise comparisons
A multi-level Adonis test using the FDR multiple test correction.
Bray-curtis pairwise result
pairwise.adonis.dm(ps_bc, sampledf$treatment, p.adjust.m = "fdr")
NA
Euclidean pairwise result
pairwise.adonis.dm(ps_euc, sampledf$treatment, p.adjust.m = "fdr")
NA
Beta diversiy PCoA and NMDS plots
PCoA and NMDS plots are used to visualize the
similarities/dissimilarities of sample points using their calculated
distance/dissimilarity matrices (from the Bray-curtis and Euclidean
distance measurements). Samples that are closer are more related and
vice versa. PCoA plots maximize the linear correlation between samples,
wherein NMDS plots maximize the rank-order correlation between samples.
Additionally, in case of NMDS, data is not required to fit a normal
distribution.
Images of each plot can be found in the
stats_analysis/beta_diversity_plots/ directory.
Bray-curtis PCoA
# Bray-curtis
#PCOA BC
ps_bc_pcoa <- ordinate(ps_object, distance = ps_bc, "PCoA")
ps_bc_pcoa_plot <- plot_ordination(ps_object, ps_bc_pcoa, color = "treatment") +
geom_point(size = 3) +
theme_classic() +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
theme(plot.title = element_text(hjust = 0.5))
ps_bc_pcoa_plot

#save as tiff
ggsave("bray_pcoa_plot.tiff", dpi = 300, plot = ps_bc_pcoa_plot, path = "./beta_diversity_plots/")
Bray-curtis NMDS
#NMDS BC
ps_bc_nmds <- ordinate(ps_object, distance = ps_bc, "NMDS")
ps_bc_nmds_plot <- plot_ordination(ps_object, ps_bc_nmds, color = "treatment") +
geom_point(size = 3) +
theme_classic() +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
theme(plot.title = element_text(hjust = 0.5))
ps_bc_nmds_plot

#save as tiff
ggsave("bray_nmds_plot.tiff", dpi = 300, plot = ps_bc_nmds_plot, path = "./beta_diversity_plots/")
Euclidean PCoA
# Euclidean
#PCOA EUC
ps_euc_pcoa <- ordinate(ps_object, distance = ps_euc, "PCoA")
ps_euc_pcoa_plot <- plot_ordination(ps_object, ps_euc_pcoa, color = "treatment") +
geom_point(size = 3) +
theme_classic() +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
theme(plot.title = element_text(hjust = 0.5))
ps_euc_pcoa_plot

#save as tiff
ggsave("euc_pcoa_plot.tiff", dpi = 300, plot = ps_euc_pcoa_plot, path = "./beta_diversity_plots/")
Euclidean NMDS
#NMDS BC
ps_euc_nmds <- ordinate(ps_object, distance = ps_euc, "NMDS")
ps_euc_nmds_plot <- plot_ordination(ps_object, ps_euc_nmds, color = "treatment") +
geom_point(size = 3) +
theme_classic() +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
theme(plot.title = element_text(hjust = 0.5))
ps_euc_nmds_plot

#save as tiff
ggsave("euc_nmds_plot.tiff", dpi = 300, plot = ps_euc_nmds_plot, path = "./beta_diversity_plots/")
Relative Abundance Taxa Plots
Relative abundance taxa bar plots for treatment
Plots were made at the phylum, genus, and species level, comparing
across the 4 treatments.
Images of each plot can be found in the stats_analysis/taxaplots/
directory.
Phylum
# transform to relative abundance
ps_rel <- transform(ps_object, "compositional")
# melt the data into a table
ps_rel_melt<- ps_rel %>%
psmelt()
# Get phylum with mean relative abundance across all samples
ps_rel_phylum_sum <- ps_rel_melt%>% group_by(Rank3) %>% dplyr::summarise(Aver = mean(Abundance))
names_ps_rel_phylum <- ps_rel_phylum_sum$Rank3
names_ps_rel_phylum
#Phylum Plot
taxaplot_ps_rel_phylum = ggplot(ps_rel_melt
, aes(x = treatment, y=Abundance)) +
geom_bar(stat="identity", position="fill", aes(fill = Rank3)) +
scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
"orchid", "#CBD588", "#8569D5", "#D14285", "#652926","grey80",
"#5E738F","#D1A33D", "#8A7C64","lightgreen","aquamarine4",
"aquamarine2", "lightsalmon", "#CD9BCD")) +
theme_bw() +
ylab("Relative Abundance") + xlab("Treatment") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0))
taxaplot_ps_rel_phylum <- taxaplot_ps_rel_phylum + guides(fill=guide_legend(title="Phylum"))
taxaplot_ps_rel_phylum

ggsave("taxaplot_phylum.tiff", dpi = 300, plot = taxaplot_ps_rel_phylum, path = "./taxaplots/")
Genus
Genera with a relative abundance lower than 0.0001 were grouped
together.
#Get genera with mean realtive abundance >0.0001 across all samples
ps_rel_genus_sum <- ps_rel_melt%>% group_by(Rank7) %>% dplyr::summarise(Aver = mean(Abundance))
ps_rel_genus_sub <- ps_rel_genus_sum[which(ps_rel_genus_sum$Aver > 0.0001),]
names_ps_rel_genus <- ps_rel_genus_sub$Rank7
names_ps_rel_genus
# Replace genera with <0.001 abundance with "NA"
ps_rel_melt$Rank7[ps_rel_melt$Rank7 != "Abiotrophia" &
ps_rel_melt$Rank7 != "Aerococcus" &
ps_rel_melt$Rank7 != "Dialister" &
ps_rel_melt$Rank7 != "Granulicatella" &
ps_rel_melt$Rank7 != "Ignavibacterium" &
ps_rel_melt$Rank7 != "Lacticaseibacillus" &
ps_rel_melt$Rank7 != "Lacticaseibacillus" &
ps_rel_melt$Rank7 != "Lactiplantibacillus" &
ps_rel_melt$Rank7 != "Lactobacillus" &
ps_rel_melt$Rank7 != "Lancefieldella" &
ps_rel_melt$Rank7 != "Lentilactobacillus" &
ps_rel_melt$Rank7 != "Levilactobacillus" &
ps_rel_melt$Rank7 != "Limosilactobacillus" &
ps_rel_melt$Rank7 != "Megasphaera" &
ps_rel_melt$Rank7 != "Moryella" &
ps_rel_melt$Rank7 != "Shuttleworthia" &
ps_rel_melt$Rank7 != "Solobacterium" &
ps_rel_melt$Rank7 != "Staphylococcus" &
ps_rel_melt$Rank7 != "Streptococcus" &
ps_rel_melt$Rank7 != "Veillonella"] <- NA
#replace NA with "Rel. Abund.<0.0001"
ps_rel_melt[is.na(ps_rel_melt)]<-"Rel. Abund.<0.0001"
#Genera plot
taxaplot_ps_rel_genus = ggplot(ps_rel_melt
, aes(x = treatment, y=Abundance)) +
geom_bar(stat="identity", position="fill", aes(fill = Rank7)) +
scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
"orchid", "#CBD588", "#8569D5", "#D14285", "#652926","grey80",
"#5E738F","#D1A33D", "#8A7C64","lightgreen","aquamarine4",
"aquamarine2", "lightsalmon", "#CD9BCD")) +
theme_bw() +
ylab("Relative Abundance") + xlab("Treatment") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0))
taxaplot_ps_rel_genus <- taxaplot_ps_rel_genus + guides(fill=guide_legend(title="Genus"))
taxaplot_ps_rel_genus

ggsave("taxaplot_genus.tiff", dpi = 300, plot = taxaplot_ps_rel_genus, path = "./taxaplots/")
Species
Species with a relative abundance lower than 0.01 were grouped
together.
#Get species with mean realtive abundance >0.01 across all samples
ps_rel_species_sum <- ps_rel_melt%>% group_by(Rank9) %>% dplyr::summarise(Aver = mean(Abundance))
ps_rel_species_sub <- ps_rel_species_sum[which(ps_rel_species_sum$Aver > 0.01),]
names_ps_rel_species <- ps_rel_species_sub$Rank9
names_ps_rel_species
# Replace genera with <0.001 abundance with "NA"
# Replace genera with <0.001 abundance with "NA"
ps_rel_melt$Rank9[ps_rel_melt$Rank9 != "Lactobacillus crispatus" &
ps_rel_melt$Rank9 != "Lactobacillus jensenii" &
ps_rel_melt$Rank9 != "Limosilactobacillus fermentum" &
ps_rel_melt$Rank9 != "Limosilactobacillus vaginalis" &
ps_rel_melt$Rank9 != "Megasphaera micronuciformis" &
ps_rel_melt$Rank9 != "Streptococcus anginosus" &
ps_rel_melt$Rank9 != "Streptococcus mutans" &
ps_rel_melt$Rank9 != "Streptococcus salivarius" &
ps_rel_melt$Rank9 != "Streptococcus vestibularis" &
ps_rel_melt$Rank9 != "Veillonella atypica" &
ps_rel_melt$Rank9 != "Veillonella dispar" &
ps_rel_melt$Rank9 != "Veillonella genomosp. P1 oral clone MB5_P17" &
ps_rel_melt$Rank9 != "Veillonella sp. oral taxon 158"] <- NA
#replace NA with "Rel. Abund.<0.01"
ps_rel_melt[is.na(ps_rel_melt)]<-"Rel. Abund.<0.01"
#Species plot
taxaplot_ps_rel_species = ggplot(ps_rel_melt
, aes(x = treatment, y=Abundance)) +
geom_bar(stat="identity", position="fill", aes(fill = Rank9)) +
scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
"orchid", "#CBD588", "#8569D5", "#D14285", "#652926","grey80",
"#5E738F", "#D1A33D")) +
theme_bw() +
ylab("Relative Abundance") + xlab("Treatment") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0))
taxaplot_ps_rel_species <- taxaplot_ps_rel_species + guides(fill=guide_legend(title="Species"))
taxaplot_ps_rel_species

ggsave("taxaplot_species.tiff", dpi = 300, plot = taxaplot_ps_rel_species, path = "./taxaplots/")
Relative abundance top 10 taxa bar plots per sample
Plots were made at the phylum, genus, and species level, comparing
across all samples.
Images of each plot can be found in the stats_analysis/taxaplots/
directory.
Top 10 Phyla
# transform to relative abundance
ps_rel <- transform(ps_object, "compositional")
#Top 10 PHYLA
#Get phylum level
ps_phylum <- tax_glom(ps_rel, taxrank= "Rank3")
#Get top 10 phylum
phylum_top10 <- names(sort(taxa_sums(ps_phylum), decreasing=TRUE))[1:10]
ps_phylum_top10 <- prune_taxa(phylum_top10, ps_phylum)
# melt the data into a table
ps_phylum_melt<- ps_phylum_top10 %>%
psmelt()
# Plot top 10 phylum by sample
taxaplot_ps_top10_phylum = ggplot(ps_phylum_melt
, aes(x = sample, y=Abundance)) +
geom_bar(stat="identity", position="fill", aes(fill = Rank3)) +
scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
"orchid", "#CBD588", "#8569D5", "#D14285")) +
theme_bw() +
ylab("Relative Abundance") + xlab("Sample") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0)) +
theme(legend.title = element_text(size=9), legend.text = element_text(size=7))
taxaplot_ps_top10_phylum <- taxaplot_ps_top10_phylum + guides(fill=guide_legend(title="Top 10 Phyla"))
taxaplot_ps_top10_phylum

ggsave("taxaplot_top10_phylum.tiff", dpi = 300, width = 18, height = 8, units = "in", plot = taxaplot_ps_top10_phylum, path = "./taxaplots/")
Top 10 Genera
#Top 10 GENERA
#Get genus level
ps_genus <- tax_glom(ps_rel, taxrank= "Rank7")
#Get top 10 genera
genus_top10 <- names(sort(taxa_sums(ps_genus), decreasing=TRUE))[1:10]
ps_genus_top10 <- prune_taxa(genus_top10, ps_genus)
# melt the data into a table
ps_genus_melt<- ps_genus_top10 %>%
psmelt()
#head(ps_genus_melt)
# Plot top 10 phylum by sample
taxaplot_ps_top10_genus = ggplot(ps_genus_melt
, aes(x = sample, y=Abundance)) +
geom_bar(stat="identity", position="fill", aes(fill = Rank7)) +
scale_fill_manual(values=c("aquamarine4", "gold3","lightpink", "firebrick","#DA5724","ivory4",
"orchid", "#CBD588", "#8569D5", "#D14285")) +
theme_bw() +
ylab("Relative Abundance") + xlab("Sample") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0)) +
theme(legend.title = element_text(size=9), legend.text = element_text(size=7))
taxaplot_ps_top10_genus <- taxaplot_ps_top10_genus + guides(fill=guide_legend(title="Top 10 Genera"))
taxaplot_ps_top10_genus

ggsave("taxaplot_top10_genus.tiff", dpi = 300, width = 18, height = 8, units = "in", plot = taxaplot_ps_top10_genus, path = "./taxaplots/")
Top 10 Species
#Top 10 SPECIES
#Get species level
ps_species <- tax_glom(ps_rel, taxrank= "Rank9")
#Get top 10 genera
species_top10 <- names(sort(taxa_sums(ps_species), decreasing=TRUE))[1:10]
ps_species_top10 <- prune_taxa(species_top10, ps_species)
# melt the data into a table
ps_species_melt<- ps_species_top10 %>%
psmelt()
#head(ps_species_melt)
# Plot top 10 phylum by sample
taxaplot_ps_top10_species = ggplot(ps_species_melt
, aes(x = sample, y=Abundance)) +
geom_bar(stat="identity", position="fill", aes(fill = Rank9)) +
scale_fill_manual(values=c("aquamarine4", "gold3","lightpink", "firebrick","#DA5724","ivory4",
"orchid", "#CBD588", "#8569D5", "#D14285")) +
theme_bw() +
ylab("Relative Abundance") + xlab("Sample") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0)) +
theme(legend.title = element_text(size=9), legend.text = element_text(size=7))
taxaplot_ps_top10_species <- taxaplot_ps_top10_species + guides(fill=guide_legend(title="Top 10 Species"))
taxaplot_ps_top10_species

ggsave("taxaplot_top10_species.tiff", dpi = 300, width = 18, height = 8, units = "in", plot = taxaplot_ps_top10_species, path = "./taxaplots/")
---
title: "16S ONT Sequence and Statistical Analysis"
output:
  html_notebook:
    code_folding: hide
    toc: true
    toc_float: true
  word_document:
    toc: true
  html_document:
    toc: true
    df_print: paged
  pdf_document:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## ONT 16S Sequencing Files
Raw fast5 files from each Oxford Nanopore (ONT) sequencing batch can be found within the directory named: raw_fast5_files/

| A total of 8 sequencing batches were run. Each batch includes 24 samples from the following treatment and donor ids:

|       1. UT1, UT2, UT3

|       2. UT4, UT5, UT6

|       3. PLA1, PLA2, PLA3

|       4. PLA4, PLA5, PLA6 *NOTE: These samples had to be re-run so they represent two separate sequencing runs

|       5. T1-1, T1-2, T1-3

|       6. T1-4, T1-5, T1-6

|       7. T2-1, T2-2, T2-3

|       8. T2-4, T2-5, T2-6

| So for example, the raw fast5 files of samples within UT1, UT2, UT3 would be in: raw_fast5_files/UT1_UT2_UT3_fast5s/

## ONT 16S Sequence QC Pipeline

| After ONT sequencing the following pipeline was used to demultiplex and quality filter reads:

| 1. Basecall with Dorado V 0.4.1 https://github.com/nanoporetech/dorado

| 2. Demultiplex with Guppy V 6.4.6 (guppy_barcoder: https://community.nanoporetech.com/docs/prepare/library_prep_protocols/Guppy-protocol/v/gpb_2003_v1_revax_14dec2018/barcoding-demultiplexing) 

| 3. Read quality filtering and trimming with Chopper V 0.6.0 https://github.com/wdecoster/chopper 
|    Trimmed fastq files can be found in the directory named: trimmed_fastqs

|       Chopper parameters used:

|           •	Min q score: 10

|           •	Min length: 1,000 bp 

|           •	Head crop: 50

|           •	Max length: 1690

| 4. Multiqc results summary using fastp qc V 0.20.1 
|    Multiqc results for each sequencing batch can be found in the directory named: multiqc

| Full pipeline with code is available at: https://github.com/MessyaszA/ONT_demux 

### Taxonomic Classification
Quality filtered and trimmed reads were then aligned to the HOMD database and biom files were created for import into phyloseq.

o	First, built a Kraken2 database of the 16S HOMD sequences (https://www.homd.org/ftp/16S_rRNA_refseq/HOMD_16S_rRNA_RefSeq/current/HOMD_16S_rRNA_RefSeq_V15.23.fasta) 

| 1. Taxonomically classified reads via Kraken2 to the built HOMD 16S database

| 2. Estimated abundance of taxa via Bracken https://ccb.jhu.edu/software/bracken/ 
|       Krona graphs of the Bracken abundance estimates for each sample can be found in the directory named: krona_bracken

| 3. Exported Bracken results as a biom file for import into phlyoseq (via taxpasta V 0.6.1 https://taxpasta.readthedocs.io/en/latest/) 

| Full pipeline with code for these 3 steps is available at: https://github.com/MessyaszA/bracken_taxpasta 

## Statistical Analysis
Diversity, differential abundance, and relative abundance of taxa were analyzed via Phyloseq and statistical tests were run between the 4 treatments. 

The biom file and metadata table imported into phyloseq can be found in the directory named: stats_analysis

Results include:

| o Alpha diversity metrics - Observed, Shannon, Simpson (csv tables in stats_analysis directory)

| o	Alpha diversity statistical tests (results below)

| o	Alpha diversity boxplots (tiff images in stats_analysis/alpha_diversity_plots directory)

|       •	*NOTE: Observed richness was run on raw and normalized data. Rarefaction to lowest read number was used as normalization.
|       This was included for comparative purposes. Other normalization methods can also be run if requested.

| o	Pairwise differential abundance test via DeSeq2 (csv table results in stats_analysis/DESEQ2 directory)

|       •	Volcano plot (tiff image in stats_analysis/DESEQ2 directory)

|       •	*NOTE: Another differential abundance method for microbiome data (ANCOM-BC
|       https://www.bioconductor.org/packages/release/bioc/html/ANCOMBC.html ) can be run if requested.


| o	Beta diversity statistical tests - Bray-Curtis, Euclidean distances

|       •	*NOTE: Methods for normalizing beta diversity metrics (i.e. calculating Aitchison distance via the deicode package in Qiime2
|       https://library.qiime2.org/plugins/deicode/19/) can be run if requested. 

| o	Beta diversity plots (tiff images in stats_analysis/beta_diversity_plots directory)

| o	Relative abundance taxa plots (tiff images can be found in stats_analysis/taxaplots directory)
```{r Libraries,  message=FALSE, warning=FALSE}

#Libraries we will use
library("DESeq2")
library("ggplot2")
library(Biostrings)
library("phyloseq")
library("microbiome")
library("ggthemes")
library(ggnetwork)
library(cowplot)
library("vegan")
library(magrittr)
library(dplyr)
library(EnhancedVolcano)

#Create a few output directories that will hold the output files. 
dir.create('./DESEQ2')
dir.create('./ANCOM')
```

```{r data_load1, warning=FALSE}
# Load data and add metadata
ps_object <- import_biom("./HOMD_biom_merge.biom")

#import medata and add to phyloseq object
metadata <- read.csv("./metadata.csv", row.names = 1)
metadata = sample_data(data.frame(metadata))
sample_data(ps_object) <- metadata
```
## Statistical Analysis Results

### Total number of taxa
The total number of taxonomically identified ASVs across all samples
```{r data_load2, warning=FALSE}
#Check sample numbers
ntaxa(ps_object) #Total taxa
```
### Number of taxonomically classified reads per sample
Samples will have varying numbers of taxonomically classified reads due to different sampling depths during sequencing. 
```{r data_load3, warning=FALSE}
#Check sample numbers
sample_sums(ps_object) #Reads per sample
```
## Alpha Diversity
Measuring species diversity within each sample.
```{r functions}
#Function to calculate the mean and the standard deviation for each group.
#data : a data frame
#varname : the name of a column containing the variable to be summarized
#groupnames : vector of column names to be used as grouping variables

# Calculate standard error
sderr <- function(x) {sd(x)/sqrt(length(x))}

data_summary <- function(data, varname, groupnames){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sderr(x[[col]]), na.rm=TRUE)
  }
  data_sum<-ddply(data, groupnames, .fun=summary_func,
                  varname)
  data_sum <- rename(data_sum, c("mean" = varname))
  return(data_sum)
}
```

### Data table with alpha diversity metrics: Observed, Shannon, Simpson
#### Example of data table structure (full table saved as csv)
Three alpha diversity metrics are calculated for each sample.
The full table (per_sample_ad_dataframe.csv) can be found in the directory named: stats_analysis
```{r data_table, warning = FALSE, message=FALSE}
alpha_div <- cbind(estimate_richness(ps_object, 
                                                   measures = c('Observed', 'Shannon', 'Simpson')),
                                 sample_data(ps_object))
colnames(alpha_div) <- c('Observed', 'Shannon', 'Simpson')
alpha_div$Labels <- rownames(alpha_div)
ad_df <- alpha_div[,c('Observed', 'Shannon', 'Simpson')]
ad_df <- cbind(ad_df,
                              sample_data(ps_object)[, c('sample', 'group', 'donor', 'treatment')])
colnames(ad_df) <- c('Observed', 'Shannon', 'Simpson', 'sample', 'group', 'donor', 'treatment')
head(ad_df)
write.csv(ad_df, "./per_sample_ad_dataframe.csv")
```
### Data Summary statistics: Observed, Shannon, Simpson
#### Summary for each treatment
The average of each alpha diversity metric for all samples within a treatment. 
```{r summary_stats, warning = FALSE, message=FALSE}
# Observed
data_summary(ad_df, varname = "Observed", groupnames = c("treatment"))
#data_summary(ad_df, varname = "Observed", groupnames = c("sample"))
#data_summary(ad_df, varname = "Observed", groupnames = c("donor"))

# Shannon
data_summary(ad_df, varname = "Shannon", groupnames = c("treatment"))
#data_summary(ad_df, varname = "Shannon", groupnames = c("sample"))
#data_summary(ad_df, varname = "Shannon", groupnames = c("donor"))

# Simpson
data_summary(ad_df, varname = "Simpson", groupnames = c("treatment"))
#data_summary(ad_df, varname = "Simpson", groupnames = c("sample"))
#data_summary(ad_df, varname = "Simpson", groupnames = c("donor"))
```
### Alpha diversity statistical tests for significance
#### Kruskal Wallis and Pairwise Wilcoxon tests
Kruskal-Wallis Rank Sum Test for a category (treatment) on each Alpha Diversity metric calculated. This is a non-parametric method which ranks the data and tests whether samples from each group analyzed come from the same distribution. 

Pairwise Wilcoxon Rank Sum Test between groups within a category (treatment) on each Alpha Diversity metric calculated and corrected for multiple testing. This methods compares two independent samples for all group levels and includes the false discovery rate (FDR) multiple test correction. The results below come up as a matrix filled with the p-values for each treatment comparison.

#### Observed Richness 
Testing ASV/species richness between the 4 treatments.
```{r observed_ad, warning = FALSE, message=FALSE}
#### Observed
#Kruskal Wallis tests
kruskal.test(Observed ~ treatment, data = ad_df)
#kruskal.test(Observed ~ sample, data = ad_df)
#kruskal.test(Observed ~ donor, data = ad_df)

#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df$Observed, ad_df$treatment, p.adjust.method = 'fdr')
```
#### Shannon Diversity
Testing ASV/species richness between the 4 treatments while also taking into account relative abundance (evenness).
```{r shannon_ad, warning = FALSE, message=FALSE}
#### Shannon
#Kruskal Wallis tests
kruskal.test(Shannon ~ treatment, data = ad_df)
#kruskal.test(Shannon ~ sample, data = ad_df)
#kruskal.test(Shannon ~ donor, data = ad_df)

#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df$Shannon, ad_df$treatment, p.adjust.method = 'fdr')
```
#### Simpson Diversity
Testing ASV/species richness between the 4 treatments while also taking into account relative abundance (evenness). Simpson diversity gives more weight to common or dominant ASVs (rare ASVs will not affect diversity).
```{r simpson_ad, warning = FALSE, message=FALSE}
#### Simpson
#Kruskal Wallis tests
kruskal.test(Simpson ~ treatment, data = ad_df)
#kruskal.test(Simpson ~ sample, data = ad_df)
#kruskal.test(Simpson ~ donor, data = ad_df)

#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df$Simpson, ad_df$treatment, p.adjust.method = 'fdr')
```

### Alpha diversity boxplots
Images of each plot can be found in: stats_analysis/alpha_diversity_plots
```{r AD, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=6, fig.width=7}
###Group plots
#Observed
obs_ad_plot <- ggplot(ad_df, aes(x=treatment, y=Observed, color = treatment)) + geom_boxplot() + geom_point() + ylab('Observed Richness') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))

obs_ad_plot

ggsave("obs_ad_plot.tiff", dpi = 300, plot = obs_ad_plot, path = "./alpha_diversity_plots/")

#Shannon
shan_ad_plot <- ggplot(ad_df, aes(x=treatment, y=Shannon, color = treatment)) + geom_boxplot() + geom_point() + ylab('Shannon Diversity') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))

shan_ad_plot

ggsave("shan_ad_plot.tiff", dpi = 300, plot = shan_ad_plot, path = "./alpha_diversity_plots/")

#Simpson
simp_ad_plot <- ggplot(ad_df, aes(x=treatment, y=Simpson, color = treatment)) + geom_boxplot() + geom_point() + ylab('Simpson Diversity') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))

simp_ad_plot

ggsave("simp_ad_plot.tiff", dpi = 300, plot = simp_ad_plot, path = "./alpha_diversity_plots/")
```
## Alpha Diversity on Rarefied Data
### Normalize Observed Richness metric by rarefying to lowest sample sum number
Shannon and Simpson diversity indices are already normalized by relative abundance. 

### Rarefaction read depth
One method of normalization is to rarefy each sample to the minimum read depth found across all samples. The minimum read depth for all samples is: 
```{r rare_num, warning=FALSE}
min(sample_sums(ps_object)) #Minimum reads (used for rarefying)
```
#### Example of datatable structure for rarefied data observed richness (full table saved as csv)
Observed richness is calculated for each sample now rarefied to the minimum read depth.
The full table (rarefied_per_sample_ad_dataframe.csv) can be found in the directory named: stats_analysis
```{r AD_rarefied, warning=FALSE, message=FALSE}
#Rarefy to lowest sample_sums
rare_num <- min(sample_sums(ps_object))
ps_rare <- rarefy_even_depth(ps_object, sample.size = rare_num, rngseed = 999)

#Get datatable with Observed richness for rarefied samples
alpha_div_rare <- cbind(estimate_richness(ps_rare, 
                                                   measures = c('Observed')),
                                 sample_data(ps_rare))
colnames(alpha_div_rare) <- c('Observed_rare')
alpha_div_rare$Labels <- rownames(alpha_div_rare)
ad_df_rare <- alpha_div_rare[,c('Observed_rare')]
ad_df_rare <- cbind(ad_df_rare,
                              sample_data(ps_rare)[, c('sample', 'group', 'donor', 'treatment')])
colnames(ad_df_rare) <- c('Observed_rare', 'sample', 'group', 'donor', 'treatment')
head(ad_df_rare)
write.csv(ad_df_rare, "./rarefied_per_sample_ad_dataframe.csv")
```
#### Data summary for rarefied data (observed richness)
The average observed richness for all samples within a treatment. 
```{r ad_rare_summary, warning=FALSE, message=FALSE}
# Observed Rarefied data summary stats
data_summary(ad_df_rare, varname = "Observed_rare", groupnames = c("treatment"))
#data_summary(ad_df_rare, varname = "Observed_rare", groupnames = c("sample"))
#data_summary(ad_df_rare, varname = "Observed_rare", groupnames = c("donor"))
```

#### Observed richness statistical tests for rarefied data
Testing ASV/species richness between the 4 treatments.
```{r ad_rare_stats, warning=FALSE, message=FALSE}

#### Observed Rarefied
#Kruskal Wallis tests
kruskal.test(Observed_rare ~ treatment, data = ad_df_rare)
#kruskal.test(Observed_rare ~ sample, data = ad_df_rare)
#kruskal.test(Observed_rare ~ donor, data = ad_df_rare)

#Pairwise Wilcoxon Test
pairwise.wilcox.test(ad_df_rare$Observed_rare, ad_df_rare$treatment, p.adjust.method = 'fdr')
```
#### Rarefied data observed richness boxplot
Images of the plot can be found in: stats_analysis/alpha_diversity_plots
```{r ad_rare_plot, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=6, fig.width=7}
#Plots: Observed Rarefied
obs_ad_plot_rare <- ggplot(ad_df_rare, aes(x=treatment, y=Observed_rare, color = treatment)) + geom_boxplot() + geom_point() + ylab('Observed Richness Rarefied Data') + scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F"))

obs_ad_plot_rare

ggsave("rarefied_obs_ad_plot.tiff", dpi = 300, plot = obs_ad_plot_rare, path = "./alpha_diversity_plots/")
```

## Differential Abundance Analysis
### Pairwise differential abundance test between treatments via DESeq2
The package DESeq2 provides methods to test for differential abundance by use of negative binomial generalized linear models. DESeq2 uses the Wald test to take into account the dispersion model distance and negative binomial distribution we see in sequencing data. Overall this method statistically infers systematic changes between conditions, as compared to within-condition variability. It estimates dispersion and logarithmic fold changes to test which sequences/ASVs are significantly changing within the experimental design. 

Additionally, a p-value correction is applied for multiple hypothesis testing. Traditionally p < 0.05 is considered significant (95% chance it is not by random chance), but when you are looking at thousands of ASVs/sequences, you'll still end up with 5% of them significant off predicted chance. One of the best corrections to use called False-Discovery Rate correction (FDR), which is the Benjamini and Hochberg method.

Here we ran DESeq2 specifying only treatment in the experimental design. Log fold changes and adjusted p-valus (padj) for all ASVs and significant ASVs have been saved as csv tables in the stats_analysis/DESEQ2 directory (named dds_result_table.csv and sig_result_table.csv respectively).

The below plot shows the fitted curve (red line) used to calculate the final dispersion estimates (blue dots) of the input data (black dots). 
```{r Deseq, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=9}
#Phyloseq to Deseq2
ps_deseq = phyloseq_to_deseq2(ps_object, ~treatment)

#Run Deseq2
dds = DESeq(ps_deseq, test="Wald", fitType="local")
dds_result <- results(dds)
dds_result_table <- cbind(as(dds_result, "data.frame"), as(tax_table(ps_object)[rownames(dds_result), ], "matrix"))
write.csv(dds_result_table, "./DESEQ2/dds_result_table.csv")

#check significant results
dds_sig = dds_result[which(dds_result$padj < 0.05), ]
dds_sig_table = cbind(as(dds_sig, "data.frame"), as(tax_table(ps_object)[rownames(dds_sig), ], "matrix"))
dim(dds_sig_table)
write.csv(dds_sig_table, "./DESEQ2/sig_result_table.csv")

#Perform the variance stabilizing transformation on the data and pull the normalized hit count matrix. 
vst <- varianceStabilizingTransformation(dds)
mat <- assay(vst)

#If you have a batch designed in, regress the batch effects. 
#mat <- limma::removeBatchEffect(mat, vst$batch)
#assay(vst) <- mat

#Plot the overall dispersion of the experiment to ensure it aligns with expectations. 
plotDispEsts(dds)
```

### Volcano plot: DeSeq2 results
Volcano plots are a great way to visualize the pairwise comparisons. Each shows the log2 fold-change (log2FC) on the x-axis and -log10(padj) on the y-axis. This should usually look akin to a Plinian eruption (hence volcano plot).

Image of the volcano plot can be found in the stats_analysis/DESEQ2/ directory
```{r Volcanoes, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=9}
p <- EnhancedVolcano(dds_result_table,
        lab=as.character(dds_result_table$Rank9),
        title="Pairwise comparisons of taxa (species level) between treatments",
        x='log2FoldChange',
        y='padj',
        legendPosition = 'bottom',
        legendLabels=c('Not sig.','Log (base 2) FC','padj-value','padj-value & Log (base 2) FC'),
        pCutoff=0.05, 
        FCcutoff=1
        )
print(p)
ggsave("volcano_plot.tiff", dpi = 300, plot = p, path = "./DESEQ2/")
```
## Beta Diversity
Measuring species diversity or the level or similarity/dissimilarity of species between samples.
```{r}
#Pairwise Adonis function
pairwise.adonis.dm <- function(x,factors,stratum=NULL,p.adjust.m="fdr",perm=999){
  
  library(vegan)
  if(class(x) != "dist") stop("x must be a dissimilarity matrix (dist object)")
  co = as.matrix(combn(unique(factors),2))
  pairs = c()
  F.Model =c()
  R2 = c()
  p.value = c()
  
  
  
  for(elem in 1:ncol(co)){
    sub_inds <- factors %in% c(as.character(co[1,elem]),as.character(co[2,elem]))
    resp <- as.matrix(x)[sub_inds,sub_inds]
    ad = adonis2(as.dist(resp) ~
                  
                  factors[sub_inds], strata=stratum[sub_inds], permutations=perm);
    
    pairs = c(pairs,paste(co[1,elem],'vs',co[2,elem]));
    F.Model =c(F.Model,ad$F[1]);
    R2 = c(R2,ad$R2[1]);
    p.value = c(p.value,ad$`Pr(>F)`[1])
    
  }
  
  p.adjusted = p.adjust(p.value,method=p.adjust.m)
  pairw.res = data.frame(pairs,F.Model,R2,p.value,p.adjusted)
  return(pairw.res)
}
```

```{r}
# Calculate distance matrices
ps_bc <- phyloseq::distance(ps_object, method = "bray")
ps_euc <- phyloseq::distance(ps_object, method = "euclidean")
```
### PERMANOVAs
PERMANOVA's with Adonis -  Permutational Multivariate Analysis of Variance. This measures dissimilarity in response to one or more factors in an analysis of variance design. Dissimilarity statistics are used to measure distances between data points and test whether they are significantly different.

We run the Adonis test specifying the treatment groups in the model formula. 

#### Bray-curtis
Bray-Curtis dissimilarity examines the abundances of ASVs that are shared between two samples, and the number of ASVs found in each. Bray-Curtis dissimilarity ranges from 0-1. If 0, the two samples share all the same ASVs; if 1, they don’t share any ASVs.

#### Permanova result
The value under the column labeled 'Pr(>F)' in the results table indicates significance (i.e. p-value)
```{r}
#make a dataframe out of the sample data
sampledf <- data.frame(sample_data(ps_object))

adonis2(ps_bc ~ treatment, data = sampledf) 
```
#### Euclidean
Measures the distance between two samples in Euclidean space (the length of the line segment between them).

#### Permanova result
The value under the column labeled 'Pr(>F)' in the results table indicates significance (i.e. p-value)
```{r}
adonis2(ps_euc ~ treatment, data = sampledf) 
```

### Pairwise comparisons
A multi-level Adonis test using the FDR multiple test correction. 

#### Bray-curtis pairwise result
```{r}
pairwise.adonis.dm(ps_bc, sampledf$treatment, p.adjust.m = "fdr")

```
#### Euclidean pairwise result
```{r}
pairwise.adonis.dm(ps_euc, sampledf$treatment, p.adjust.m = "fdr") 

```
### Beta diversiy PCoA and NMDS plots
PCoA and NMDS plots are used to visualize the similarities/dissimilarities of sample points using their calculated distance/dissimilarity matrices (from the Bray-curtis and Euclidean distance measurements). Samples that are closer are more related and vice versa. 
PCoA plots maximize the linear correlation between samples, wherein NMDS plots maximize the rank-order correlation between samples. Additionally, in case of NMDS, data is not required to fit a normal distribution.

Images of each plot can be found in the stats_analysis/beta_diversity_plots/ directory. 

#### Bray-curtis PCoA
```{r BD_bray, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=7, fig.width=8}
# Bray-curtis
#PCOA BC
ps_bc_pcoa <- ordinate(ps_object, distance = ps_bc, "PCoA")

ps_bc_pcoa_plot <- plot_ordination(ps_object, ps_bc_pcoa, color = "treatment") + 
  geom_point(size = 3) + 
  theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank()) + 
  scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
  theme(plot.title = element_text(hjust = 0.5))
ps_bc_pcoa_plot

#save as tiff
ggsave("bray_pcoa_plot.tiff", dpi = 300, plot = ps_bc_pcoa_plot, path = "./beta_diversity_plots/")
```
#### Bray-curtis NMDS
```{r BD_bray2, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=7, fig.width=8}
#NMDS BC
ps_bc_nmds <- ordinate(ps_object, distance = ps_bc, "NMDS")

ps_bc_nmds_plot <- plot_ordination(ps_object, ps_bc_nmds, color = "treatment") + 
  geom_point(size = 3) + 
  theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank()) + 
  scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
  theme(plot.title = element_text(hjust = 0.5))
ps_bc_nmds_plot

#save as tiff
ggsave("bray_nmds_plot.tiff", dpi = 300, plot = ps_bc_nmds_plot, path = "./beta_diversity_plots/")
```
#### Euclidean PCoA
```{r BD_euc, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=7, fig.width=8}

# Euclidean
#PCOA EUC
ps_euc_pcoa <- ordinate(ps_object, distance = ps_euc, "PCoA")

ps_euc_pcoa_plot <- plot_ordination(ps_object, ps_euc_pcoa, color = "treatment") + 
  geom_point(size = 3) + 
  theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank()) + 
  scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
  theme(plot.title = element_text(hjust = 0.5))
ps_euc_pcoa_plot

#save as tiff
ggsave("euc_pcoa_plot.tiff", dpi = 300, plot = ps_euc_pcoa_plot, path = "./beta_diversity_plots/")
```
#### Euclidean NMDS
```{r BD_euc2, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=7, fig.width=8}
#NMDS BC
ps_euc_nmds <- ordinate(ps_object, distance = ps_euc, "NMDS")

ps_euc_nmds_plot <- plot_ordination(ps_object, ps_euc_nmds, color = "treatment") + 
  geom_point(size = 3) + 
  theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank()) + 
  scale_colour_manual(values = c("#954535", "#702963", "#F28C28", "#088F8F")) +
  theme(plot.title = element_text(hjust = 0.5))
ps_euc_nmds_plot

#save as tiff
ggsave("euc_nmds_plot.tiff", dpi = 300, plot = ps_euc_nmds_plot, path = "./beta_diversity_plots/")
```

## Relative Abundance Taxa Plots
### Relative abundance taxa bar plots for treatment
Plots were made at the phylum, genus, and species level, comparing across the 4 treatments.  

Images of each plot can be found in the stats_analysis/taxaplots/ directory. 

#### Phylum
```{r Taxaplots_phy, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=7}
# transform to relative abundance
ps_rel <- transform(ps_object, "compositional")

# melt the data into a table
ps_rel_melt<- ps_rel %>%
  psmelt()

# Get phylum with mean relative abundance across all samples 
ps_rel_phylum_sum <- ps_rel_melt%>% group_by(Rank3) %>% dplyr::summarise(Aver = mean(Abundance))
names_ps_rel_phylum <- ps_rel_phylum_sum$Rank3
names_ps_rel_phylum

#Phylum Plot
taxaplot_ps_rel_phylum = ggplot(ps_rel_melt
                    , aes(x = treatment, y=Abundance)) + 
  geom_bar(stat="identity", position="fill", aes(fill = Rank3))  +
  scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
                             "orchid", "#CBD588", "#8569D5", "#D14285", "#652926","grey80",
                             "#5E738F","#D1A33D", "#8A7C64","lightgreen","aquamarine4",
                             "aquamarine2", "lightsalmon", "#CD9BCD")) +
  theme_bw() +
  ylab("Relative Abundance") + xlab("Treatment") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0))
taxaplot_ps_rel_phylum <- taxaplot_ps_rel_phylum + guides(fill=guide_legend(title="Phylum"))
taxaplot_ps_rel_phylum

ggsave("taxaplot_phylum.tiff", dpi = 300, plot = taxaplot_ps_rel_phylum, path = "./taxaplots/")
```
#### Genus
Genera with a relative abundance lower than 0.0001 were grouped together.
```{r Taxaplots_gen, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=7}

#Get genera with mean realtive abundance >0.0001 across all samples 
ps_rel_genus_sum <- ps_rel_melt%>% group_by(Rank7) %>% dplyr::summarise(Aver = mean(Abundance))
ps_rel_genus_sub <- ps_rel_genus_sum[which(ps_rel_genus_sum$Aver > 0.0001),]
names_ps_rel_genus <- ps_rel_genus_sub$Rank7
names_ps_rel_genus

# Replace genera with <0.001 abundance with "NA"
ps_rel_melt$Rank7[ps_rel_melt$Rank7 != "Abiotrophia" &
                           ps_rel_melt$Rank7 != "Aerococcus" &
                           ps_rel_melt$Rank7 != "Dialister" &
                           ps_rel_melt$Rank7 != "Granulicatella" &
                           ps_rel_melt$Rank7 != "Ignavibacterium" &
                           ps_rel_melt$Rank7 != "Lacticaseibacillus" &
                           ps_rel_melt$Rank7 != "Lacticaseibacillus" &
                           ps_rel_melt$Rank7 != "Lactiplantibacillus" &
                           ps_rel_melt$Rank7 != "Lactobacillus" &
                           ps_rel_melt$Rank7 != "Lancefieldella" &
                           ps_rel_melt$Rank7 != "Lentilactobacillus" &
                           ps_rel_melt$Rank7 != "Levilactobacillus" &
                           ps_rel_melt$Rank7 != "Limosilactobacillus" &
                           ps_rel_melt$Rank7 != "Megasphaera" &
                           ps_rel_melt$Rank7 != "Moryella" &
                           ps_rel_melt$Rank7 != "Shuttleworthia" &
                           ps_rel_melt$Rank7 != "Solobacterium" &
                           ps_rel_melt$Rank7 != "Staphylococcus" &
                           ps_rel_melt$Rank7 != "Streptococcus" &
                           ps_rel_melt$Rank7 != "Veillonella"] <- NA

#replace NA with "Rel. Abund.<0.0001"
ps_rel_melt[is.na(ps_rel_melt)]<-"Rel. Abund.<0.0001"
#Genera plot
taxaplot_ps_rel_genus = ggplot(ps_rel_melt
                    , aes(x = treatment, y=Abundance)) + 
  geom_bar(stat="identity", position="fill", aes(fill = Rank7))  +
  scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
                             "orchid", "#CBD588", "#8569D5", "#D14285", "#652926","grey80",
                             "#5E738F","#D1A33D", "#8A7C64","lightgreen","aquamarine4",
                             "aquamarine2", "lightsalmon", "#CD9BCD")) +
  theme_bw() +
  ylab("Relative Abundance") + xlab("Treatment") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0))
taxaplot_ps_rel_genus <- taxaplot_ps_rel_genus + guides(fill=guide_legend(title="Genus"))
taxaplot_ps_rel_genus
ggsave("taxaplot_genus.tiff", dpi = 300, plot = taxaplot_ps_rel_genus, path = "./taxaplots/")
```
#### Species
Species with a relative abundance lower than 0.01 were grouped together.
```{r Taxaplots_sp, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=7}
#Get species with mean realtive abundance >0.01 across all samples 
ps_rel_species_sum <- ps_rel_melt%>% group_by(Rank9) %>% dplyr::summarise(Aver = mean(Abundance))
ps_rel_species_sub <- ps_rel_species_sum[which(ps_rel_species_sum$Aver > 0.01),]
names_ps_rel_species <- ps_rel_species_sub$Rank9
names_ps_rel_species

# Replace genera with <0.001 abundance with "NA"
# Replace genera with <0.001 abundance with "NA"
ps_rel_melt$Rank9[ps_rel_melt$Rank9 != "Lactobacillus crispatus" &
                           ps_rel_melt$Rank9 != "Lactobacillus jensenii" &
                           ps_rel_melt$Rank9 != "Limosilactobacillus fermentum" &
                           ps_rel_melt$Rank9 != "Limosilactobacillus vaginalis" &
                           ps_rel_melt$Rank9 != "Megasphaera micronuciformis" &
                           ps_rel_melt$Rank9 != "Streptococcus anginosus" &
                           ps_rel_melt$Rank9 != "Streptococcus mutans" &
                           ps_rel_melt$Rank9 != "Streptococcus salivarius" &
                           ps_rel_melt$Rank9 != "Streptococcus vestibularis" &
                           ps_rel_melt$Rank9 != "Veillonella atypica" &
                           ps_rel_melt$Rank9 != "Veillonella dispar" &
                           ps_rel_melt$Rank9 != "Veillonella genomosp. P1 oral clone MB5_P17" &
                           ps_rel_melt$Rank9 != "Veillonella sp. oral taxon 158"] <- NA

#replace NA with "Rel. Abund.<0.01"
ps_rel_melt[is.na(ps_rel_melt)]<-"Rel. Abund.<0.01"
#Species plot
taxaplot_ps_rel_species = ggplot(ps_rel_melt
                    , aes(x = treatment, y=Abundance)) + 
  geom_bar(stat="identity", position="fill", aes(fill = Rank9))  +
  scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
                             "orchid", "#CBD588", "#8569D5", "#D14285", "#652926","grey80",
                             "#5E738F", "#D1A33D")) +
  theme_bw() +
  ylab("Relative Abundance") + xlab("Treatment") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0))
taxaplot_ps_rel_species <- taxaplot_ps_rel_species + guides(fill=guide_legend(title="Species"))
taxaplot_ps_rel_species
ggsave("taxaplot_species.tiff", dpi = 300, plot = taxaplot_ps_rel_species, path = "./taxaplots/")

```

### Relative abundance top 10 taxa bar plots per sample
Plots were made at the phylum, genus, and species level, comparing across all samples.  

Images of each plot can be found in the stats_analysis/taxaplots/ directory. 

#### Top 10 Phyla
```{r Taxaplots2, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=18} 
# transform to relative abundance
ps_rel <- transform(ps_object, "compositional")

#Top 10 PHYLA
#Get phylum level
ps_phylum <- tax_glom(ps_rel, taxrank= "Rank3")
#Get top 10 phylum
phylum_top10 <- names(sort(taxa_sums(ps_phylum), decreasing=TRUE))[1:10]
ps_phylum_top10 <- prune_taxa(phylum_top10, ps_phylum)

# melt the data into a table
ps_phylum_melt<- ps_phylum_top10 %>%
  psmelt()

# Plot top 10 phylum by sample
taxaplot_ps_top10_phylum = ggplot(ps_phylum_melt
                    , aes(x = sample, y=Abundance)) + 
  geom_bar(stat="identity", position="fill", aes(fill = Rank3))  +
  scale_fill_manual(values=c("aquamarine4", "gold","lightpink", "firebrick","#DA5724","ivory4",
                             "orchid", "#CBD588", "#8569D5", "#D14285")) +
  theme_bw() +
  ylab("Relative Abundance") + xlab("Sample") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0)) +
  theme(legend.title = element_text(size=9), legend.text = element_text(size=7))
taxaplot_ps_top10_phylum <- taxaplot_ps_top10_phylum + guides(fill=guide_legend(title="Top 10 Phyla"))
taxaplot_ps_top10_phylum

ggsave("taxaplot_top10_phylum.tiff", dpi = 300, width = 18, height = 8, units = "in", plot = taxaplot_ps_top10_phylum, path = "./taxaplots/")
```
#### Top 10 Genera
```{r Taxaplots_top_gen, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=18}

#Top 10 GENERA
#Get genus level
ps_genus <- tax_glom(ps_rel, taxrank= "Rank7")
#Get top 10 genera
genus_top10 <- names(sort(taxa_sums(ps_genus), decreasing=TRUE))[1:10]
ps_genus_top10 <- prune_taxa(genus_top10, ps_genus)

# melt the data into a table
ps_genus_melt<- ps_genus_top10 %>%
  psmelt()
#head(ps_genus_melt)

# Plot top 10 phylum by sample
taxaplot_ps_top10_genus = ggplot(ps_genus_melt
                    , aes(x = sample, y=Abundance)) + 
  geom_bar(stat="identity", position="fill", aes(fill = Rank7))  +
  scale_fill_manual(values=c("aquamarine4", "gold3","lightpink", "firebrick","#DA5724","ivory4",
                             "orchid", "#CBD588", "#8569D5", "#D14285")) +
  theme_bw() +
  ylab("Relative Abundance") + xlab("Sample") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0)) +
  theme(legend.title = element_text(size=9), legend.text = element_text(size=7))
taxaplot_ps_top10_genus <- taxaplot_ps_top10_genus + guides(fill=guide_legend(title="Top 10 Genera"))
taxaplot_ps_top10_genus

ggsave("taxaplot_top10_genus.tiff", dpi = 300, width = 18, height = 8, units = "in", plot = taxaplot_ps_top10_genus, path = "./taxaplots/")
```
#### Top 10 Species
```{r Taxaplots_top_sp, warning=FALSE, message=FALSE, fig.align="center", results='hide', fig.height=8, fig.width=18}

#Top 10 SPECIES
#Get species level
ps_species <- tax_glom(ps_rel, taxrank= "Rank9")
#Get top 10 genera
species_top10 <- names(sort(taxa_sums(ps_species), decreasing=TRUE))[1:10]
ps_species_top10 <- prune_taxa(species_top10, ps_species)

# melt the data into a table
ps_species_melt<- ps_species_top10 %>%
  psmelt()
#head(ps_species_melt)

# Plot top 10 phylum by sample
taxaplot_ps_top10_species = ggplot(ps_species_melt
                    , aes(x = sample, y=Abundance)) + 
  geom_bar(stat="identity", position="fill", aes(fill = Rank9))  +
  scale_fill_manual(values=c("aquamarine4", "gold3","lightpink", "firebrick","#DA5724","ivory4",
                             "orchid", "#CBD588", "#8569D5", "#D14285")) +
  theme_bw() +
  ylab("Relative Abundance") + xlab("Sample") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0)) +
  theme(legend.title = element_text(size=9), legend.text = element_text(size=7))
taxaplot_ps_top10_species <- taxaplot_ps_top10_species + guides(fill=guide_legend(title="Top 10 Species"))
taxaplot_ps_top10_species

ggsave("taxaplot_top10_species.tiff", dpi = 300, width = 18, height = 8, units = "in", plot = taxaplot_ps_top10_species, path = "./taxaplots/")

```
